Skip to content

Conversation

@huaxingao
Copy link
Contributor

@huaxingao huaxingao commented May 29, 2020

What changes were proposed in this pull request?

Add Coalesce/Repartition/Repartition_By_Range Hints to SQL Reference

Why are the changes needed?

To make SQL reference complete

Does this PR introduce any user-facing change?

Screen Shot 2020-05-29 at 6 46 38 PM

Screen Shot 2020-05-29 at 6 43 30 PM

Only the the above pages are changed. The following two pages are the same as before.

Screen Shot 2020-05-28 at 10 05 27 PM

Screen Shot 2020-05-28 at 10 05 08 PM

How was this patch tested?

Manually build and check

@huaxingao
Copy link
Contributor Author

Add Coalesce/Repartition/Repartition_By_Range Hints to SQL Reference per @gatorsmile request.
cc @maropu @dilipbiswal @ulysses-you @jzhuge @xuanyuanking

@SparkQA
Copy link

SparkQA commented May 29, 2020

Test build #123263 has finished for PR 28672 at commit 0b3e765.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented May 29, 2020

This is for 3.0? Btw, could you assign a new jira ID to this PR?

/*+ hint [ , ... ] */
```

### Coalesce/Repartition/Repartition_By_Range Hints
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about simply saying ### Partitioning Hints here?


### Examples
```sql
SELECT /*+ COALESCE(3) */ * FROM t;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about showing a spark plan via explain?


### Coalesce/Repartition/Repartition_By_Range Hints

Coalesce/Repartition/Repartition_By_Range hints have functionalities equivalent to those of the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you follow the same format with the Join hints? e.g., Coalesce -> `COALESCE`

@huaxingao huaxingao changed the title [SPARK-31333][SQL][DOCS][FOLLOW-UP] Add Coalesce/Repartition/Repartition_By_Range Hints to SQL Reference [SPARK-31866][SQL][DOCS] Add Coalesce/Repartition/Repartition_By_Range Hints to SQL Reference May 29, 2020
@huaxingao
Copy link
Contributor Author

Yes, it's for 3.0. I created jira SPARK-31866. @maropu

@maropu maropu changed the title [SPARK-31866][SQL][DOCS] Add Coalesce/Repartition/Repartition_By_Range Hints to SQL Reference [SPARK-31866][SQL][DOCS] Add COALESCE/REPARTITION/REPARTITION_BY_RANGE Hints to SQL Reference May 29, 2020
@SparkQA
Copy link

SparkQA commented May 29, 2020

Test build #123271 has finished for PR 28672 at commit 60fdb93.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Location: CatalogFileIndex[file:/spark/spark-warehouse/t], PartitionFilters: [],
PushedFilters: [], ReadSchema: struct<name:string>

SELECT /*+ REPARTITION(3) */ * FROM t;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need these statements having no output as the example?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more comment; probably, the join hint section should have the same format for the examples.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I will only have one example for explain. Otherwise the example section will be too long.
I will leave the join hint example section as is for now. Don't want this section to be too long.


### Join Hints

Join Hints allow users to suggest the join strategy that Spark should use. Prior to Spark 3.0, only the `BROADCAST` Join Hint was supported. `MERGE`, `SHUFFLE_HASH` and `SHUFFLE_REPLICATE_NL` Joint Hints support was added in 3.0. When different join strategy hints are specified on both sides of a join, Spark prioritizes hints in the following order: `BROADCAST` over `MERGE` over `SHUFFLE_HASH` over `SHUFFLE_REPLICATE_NL`. When both sides are specified with the `BROADCAST` hint or the `SHUFFLE_HASH` hint, Spark will pick the build side based on the join type and the sizes of the relations. Since a given strategy may not support all join types, Spark is not guaranteed to use the join strategy suggested by the hint.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hints -> hints?


### Partitioning Hints

`COALESCE`/`REPARTITION`/`REPARTITION_BY_RANGE` hints have functionalities equivalent to those of the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about rephrasing it like this?


Partitioning hints allow users to suggest a partitioning way that Spark should follow. COALESCE, REPARTITION, and REPARTITION_BY_RANGE hints are supported and they are equivalent to coalesce, repartition, and repartitionByRange Dataset APIs, respectively.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, could you add links to the Dataset APIs if we describe them here? https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset

### Partitioning Hints

`COALESCE`/`REPARTITION`/`REPARTITION_BY_RANGE` hints have functionalities equivalent to those of the
`Dataset` `coalesce`/`repartition`/`repartitionByRange` APIs. The `COALESCE` hint can be used to reduce
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about moving the explanations for each hint (e.g., The COALESCE hint can be used to reduce...) into a new section like ### Partitiong Hints Types?

@SparkQA
Copy link

SparkQA commented May 29, 2020

Test build #123294 has finished for PR 28672 at commit 8a7fa09.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 29, 2020

Test build #123299 has finished for PR 28672 at commit 7f97fe3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

and `REPARTITION_BY_RANGE` hints are supported and are equivalent to `coalesce`, `repartition`, and
`repartitionByRange` [Dataset APIs](api/scala/org/apache/spark/sql/Dataset.html), respectively. These hints give users
a way to tune performance and control the number of output files in Spark SQL. When multiple partitioning hints are
specified, multiple nodes are inserted into the logical plan, but the leftmost hint is picked by the optimizer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The description of multiple hints is duplicated in https://github.com/apache/spark/pull/28672/files#diff-84ec3ee2cc31db6fd14e15058e35435cR69, maybe we just keep the one with the example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comment. I will keep the one in description.

a way to tune performance and control the number of output files in Spark SQL. When multiple partitioning hints are
specified, multiple nodes are inserted into the logical plan, but the leftmost hint is picked by the optimizer.

### Partitioning Hints Types
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#### Partitioning Hints Types?

The `REPARTITION_BY_RANGE` hint can be used to repartition to the specified number of partitions using the specified partitioning expressions. It takes column names and an optional partition number as parameters.


### Examples
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto, #### Examples


The `REPARTITION_BY_RANGE` hint can be used to repartition to the specified number of partitions using the specified partitioning expressions. It takes column names and an optional partition number as parameters.


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove the unencessary blank line.

@SparkQA
Copy link

SparkQA commented May 30, 2020

Test build #123305 has finished for PR 28672 at commit bc4fdfc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@maropu maropu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks okay

Copy link
Member

@xuanyuanking xuanyuanking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@huaxingao
Copy link
Contributor Author

cc @srowen This is for 3.0. Thank you!

@srowen srowen closed this in 1b780f3 May 30, 2020
srowen pushed a commit that referenced this pull request May 30, 2020
…E Hints to SQL Reference

Add Coalesce/Repartition/Repartition_By_Range Hints to SQL Reference

To make SQL reference complete

<img width="1100" alt="Screen Shot 2020-05-29 at 6 46 38 PM" src="https://user-images.githubusercontent.com/13592258/83316782-d6fcf300-a1dc-11ea-87f6-e357b9c739fd.png">

<img width="1099" alt="Screen Shot 2020-05-29 at 6 43 30 PM" src="https://user-images.githubusercontent.com/13592258/83316784-d8c6b680-a1dc-11ea-95ea-10a1f75dcef9.png">

Only the the above pages are changed. The following two pages are the same as before.

<img width="1100" alt="Screen Shot 2020-05-28 at 10 05 27 PM" src="https://user-images.githubusercontent.com/13592258/83223474-bfb3fc00-a12f-11ea-807a-824a618afa0b.png">

<img width="1099" alt="Screen Shot 2020-05-28 at 10 05 08 PM" src="https://user-images.githubusercontent.com/13592258/83223478-c2165600-a12f-11ea-806e-a1e57dc35ef4.png">

Manually build and check

Closes #28672 from huaxingao/coalesce_hint.

Authored-by: Huaxin Gao <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
(cherry picked from commit 1b780f3)
Signed-off-by: Sean Owen <[email protected]>
@srowen
Copy link
Member

srowen commented May 30, 2020

Merged to master/3.0. 3.0 had a very minor-looking merge conflict which I resolved directly.

@huaxingao
Copy link
Contributor Author

Thanks! @srowen @maropu @xuanyuanking

@huaxingao huaxingao deleted the coalesce_hint branch May 30, 2020 20:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants